{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Defining and customising how record comparisons are made\n",
    "\n",
    "A key feature of Splink is the ability to customise how record comparisons are made - that is, how similarity is defined for different data types.  For example, the definition of similarity that is appropriate for a date of birth field is different than for a first name field.\n",
    "\n",
    "By tailoring the definitions of similarity, linking models are more effectively able to distinguish between different gradations of similarity, leading to more accurate data linking models.\n",
    "\n",
    "## `Comparisons` and `ComparisonLevels`\n",
    "\n",
    "[Recall that](./comparisons_and_comparison_levels.md) a Splink model contains a collection of `Comparisons` and `ComparisonLevels` organised in a hierarchy.  \n",
    "\n",
    "Each `ComparisonLevel` defines the different gradations of similarity that make up a `Comparison`.\n",
    "\n",
    "An example is as follows:\n",
    "\n",
    "```\n",
    "Data Linking Model\n",
    "├─-- Comparison: Date of birth\n",
    "│    ├─-- ComparisonLevel: Exact match\n",
    "│    ├─-- ComparisonLevel: Up to one character difference\n",
    "│    ├─-- ComparisonLevel: Up to three character difference\n",
    "│    ├─-- ComparisonLevel: All other\n",
    "├─-- Comparison: Name\n",
    "│    ├─-- ComparisonLevel: Exact match on first name and surname\n",
    "│    ├─-- ComparisonLevel: Exact match on first name\n",
    "│    ├─-- etc.\n",
    "```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Three ways of specifying Comparisons\n",
    "\n",
    "In Splink, there are three ways of specifying `Comparisons`:\n",
    "\n",
    "- Using 'out-of-the-box' `Comparison`s (Most simple/succinct)\n",
    "- Composing pre-defined `ComparisonLevels` \n",
    "- Writing a full dictionary spec of a `Comparison` by hand (most verbose/flexible)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<hr>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Method 1: Using the `ComparisonLibrary` \n",
    "\n",
    "The `ComparisonLibrary`  contains pre-baked similarity functions that cover many common use cases.\n",
    "\n",
    "These functions generate an entire `Comparison`, composed of several `ComparisonLevels`.\n",
    "\n",
    "You can find a listing of all available `Comparison`s at the page for its API documentation [here](../../api_docs/comparison_library.md)\n",
    "\n",
    "\n",
    "The following provides an example of using the `ExactMatch` `Comparison`, and producing the description (with associated SQL) for the `duckdb` backend:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Comparison 'ExactMatch' of \"first_name\".\n",
      "Similarity is assessed using the following ComparisonLevels:\n",
      "    - 'first_name is NULL' with SQL rule: \"first_name_l\" IS NULL OR \"first_name_r\" IS NULL\n",
      "    - 'Exact match on first_name' with SQL rule: \"first_name_l\" = \"first_name_r\"\n",
      "    - 'All other comparisons' with SQL rule: ELSE\n",
      "\n"
     ]
    }
   ],
   "source": [
    "import splink.comparison_library as cl\n",
    "\n",
    "first_name_comparison = cl.ExactMatch(\"first_name\")\n",
    "print(first_name_comparison.get_comparison(\"duckdb\").human_readable_description)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that, under the hood, these functions generate a Python dictionary, which conforms to the underlying `.json` specification of a model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'output_column_name': 'first_name',\n",
       " 'comparison_levels': [{'sql_condition': '\"first_name_l\" IS NULL OR \"first_name_r\" IS NULL',\n",
       "   'label_for_charts': 'first_name is NULL',\n",
       "   'is_null_level': True},\n",
       "  {'sql_condition': '\"first_name_l\" = \"first_name_r\"',\n",
       "   'label_for_charts': 'Exact match on first_name'},\n",
       "  {'sql_condition': 'ELSE', 'label_for_charts': 'All other comparisons'}],\n",
       " 'comparison_description': 'ExactMatch'}"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "first_name_comparison.get_comparison(\"duckdb\").as_dict()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now generate a second, more complex comparison using one of our data-specific comparisons, the `PostcodeComparison`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Comparison 'PostcodeComparison' of \"postcode\".\n",
      "Similarity is assessed using the following ComparisonLevels:\n",
      "    - 'postcode is NULL' with SQL rule: \"postcode_l\" IS NULL OR \"postcode_r\" IS NULL\n",
      "    - 'Exact match on full postcode' with SQL rule: \"postcode_l\" = \"postcode_r\"\n",
      "    - 'Exact match on sector' with SQL rule: NULLIF(regexp_extract(\"postcode_l\", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '') = NULLIF(regexp_extract(\"postcode_r\", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]? [0-9]', 0), '')\n",
      "    - 'Exact match on district' with SQL rule: NULLIF(regexp_extract(\"postcode_l\", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?', 0), '') = NULLIF(regexp_extract(\"postcode_r\", '^[A-Za-z]{1,2}[0-9][A-Za-z0-9]?', 0), '')\n",
      "    - 'Exact match on area' with SQL rule: NULLIF(regexp_extract(\"postcode_l\", '^[A-Za-z]{1,2}', 0), '') = NULLIF(regexp_extract(\"postcode_r\", '^[A-Za-z]{1,2}', 0), '')\n",
      "    - 'All other comparisons' with SQL rule: ELSE\n",
      "\n"
     ]
    }
   ],
   "source": [
    "pc_comparison = cl.PostcodeComparison(\"postcode\")\n",
    "print(pc_comparison.get_comparison(\"duckdb\").human_readable_description)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For a deep dive on out of the box comparisons, see the dedicated [topic guide](./out_of_the_box_comparisons.ipynb).\n",
    "\n",
    "Comparisons can be further configured using the `.configure()` method - full API docs [here](../../api_docs/comparison_library.md#splink.internals.comparison_creator.ComparisonCreator.configure).\n",
    "\n",
    "<hr>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Method 2: `ComparisonLevels`\n",
    "\n",
    "`ComparisonLevels` provide a lower-level API that allows you to compose your own comparisons.\n",
    "\n",
    "For example, the user may wish to specify a comparison that has levels for a match on soundex and jaro_winkler of the `first_name` field.  \n",
    "\n",
    "The below example assumes the user has derived a column `soundex_first_name` which contains the soundex of the first name."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Comparison 'CustomComparison' of \"first_name\" and \"soundex_first_name\".\n",
      "Similarity is assessed using the following ComparisonLevels:\n",
      "    - 'first_name is NULL' with SQL rule: \"first_name_l\" IS NULL OR \"first_name_r\" IS NULL\n",
      "    - 'Exact match on first_name' with SQL rule: \"first_name_l\" = \"first_name_r\"\n",
      "    - 'Exact match on soundex_first_name' with SQL rule: \"soundex_first_name_l\" = \"soundex_first_name_r\"\n",
      "    - 'All other comparisons' with SQL rule: ELSE\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from splink.comparison_library import CustomComparison\n",
    "import splink.comparison_level_library as cll\n",
    "\n",
    "custom_name_comparison = CustomComparison(\n",
    "    output_column_name=\"first_name\",\n",
    "    comparison_levels=[\n",
    "        cll.NullLevel(\"first_name\"),\n",
    "        cll.ExactMatchLevel(\"first_name\").configure(tf_adjustment_column=\"first_name\"),\n",
    "        cll.ExactMatchLevel(\"soundex_first_name\").configure(\n",
    "            tf_adjustment_column=\"soundex_first_name\"\n",
    "        ),\n",
    "        cll.ElseLevel(),\n",
    "    ],\n",
    ")\n",
    "\n",
    "print(custom_name_comparison.get_comparison(\"duckdb\").human_readable_description)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This can now be specified in the settings dictionary as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "from splink import SettingsCreator, block_on\n",
    "\n",
    "settings = SettingsCreator(\n",
    "    link_type=\"dedupe_only\",\n",
    "    blocking_rules_to_generate_predictions=[\n",
    "        block_on(\"first_name\"),\n",
    "        block_on(\"surname\"),\n",
    "    ],\n",
    "    comparisons=[\n",
    "        custom_name_comparison,\n",
    "        cl.LevenshteinAtThresholds(\"dob\", [1, 2]),\n",
    "    ],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To inspect the custom comparison as a dictionary, you can call `custom_name_comparison.get_comparison(\"duckdb\").as_dict()`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that `ComparisonLevels` can be further configured using the `.configure()` method - full API documentation [here](../../api_docs/comparison_level_library.md#splink.internals.comparison_creator.ComparisonLevelCreator.configure)  \n",
    "\n",
    "<hr>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Method 3: Providing the spec as a dictionary\n",
    "\n",
    "Behind the scenes in Splink, all `Comparisons` are eventually turned into a dictionary which conforms to [the formal `jsonschema` specification of the settings dictionary](https://github.com/moj-analytical-services/splink/blob/master/splink/files/settings_jsonschema.json) and [here](https://moj-analytical-services.github.io/splink/).\n",
    "\n",
    "The library functions described above are convenience functions that provide a shorthand way to produce valid dictionaries.\n",
    "\n",
    "For maximum control over your settings, you can specify your comparisons as a dictionary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "comparison_first_name = {\n",
    "    \"output_column_name\": \"first_name\",\n",
    "    \"comparison_levels\": [\n",
    "        {\n",
    "            \"sql_condition\": \"first_name_l IS NULL OR first_name_r IS NULL\",\n",
    "            \"label_for_charts\": \"Null\",\n",
    "            \"is_null_level\": True,\n",
    "        },\n",
    "        {\n",
    "            \"sql_condition\": \"first_name_l = first_name_r\",\n",
    "            \"label_for_charts\": \"Exact match\",\n",
    "            \"tf_adjustment_column\": \"first_name\",\n",
    "            \"tf_adjustment_weight\": 1.0,\n",
    "            \"tf_minimum_u_value\": 0.001,\n",
    "        },\n",
    "        {\n",
    "            \"sql_condition\": \"dmeta_first_name_l = dmeta_first_name_r\",\n",
    "            \"label_for_charts\": \"Exact match\",\n",
    "            \"tf_adjustment_column\": \"dmeta_first_name\",\n",
    "            \"tf_adjustment_weight\": 1.0,\n",
    "        },\n",
    "        {\n",
    "            \"sql_condition\": \"jaro_winkler_sim(first_name_l, first_name_r) > 0.8\",\n",
    "            \"label_for_charts\": \"Exact match\",\n",
    "            \"tf_adjustment_column\": \"first_name\",\n",
    "            \"tf_adjustment_weight\": 0.5,\n",
    "            \"tf_minimum_u_value\": 0.001,\n",
    "        },\n",
    "        {\"sql_condition\": \"ELSE\", \"label_for_charts\": \"All other comparisons\"},\n",
    "    ],\n",
    "}\n",
    "\n",
    "settings = SettingsCreator(\n",
    "    link_type=\"dedupe_only\",\n",
    "    blocking_rules_to_generate_predictions=[\n",
    "        block_on(\"first_name\"),\n",
    "        block_on(\"surname\"),\n",
    "    ],\n",
    "    comparisons=[\n",
    "        comparison_first_name,\n",
    "        cl.LevenshteinAtThresholds(\"dob\", [1, 2]),\n",
    "    ],\n",
    ")\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Examples\n",
    "\n",
    "Below are some examples of how you can define the same comparison, but through different methods."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "### Exact match Comparison with Term-Frequency Adjustments\n",
    "\n",
    "\n",
    "\n",
    "===+ \"Comparison Library\"\n",
    "\n",
    "    ```py\n",
    "    import splink.comparison_library as cl\n",
    "\n",
    "    first_name_comparison = cl.ExactMatch(\"first_name\").configure(\n",
    "        term_frequency_adjustments=True\n",
    "    )\n",
    "    ```\n",
    "\n",
    "=== \"Comparison Level Library\"\n",
    "\n",
    "    ```py\n",
    "    import splink.comparison_level_library as cll\n",
    "\n",
    "    first_name_comparison = cl.CustomComparison(\n",
    "        output_column_name=\"first_name\",\n",
    "        comparison_description=\"Exact match vs. anything else\",\n",
    "        comparison_levels=[\n",
    "            cll.NullLevel(\"first_name\"),\n",
    "            cll.ExactMatchLevel(\"first_name\").configure(tf_adjustment_column=\"first_name\"),\n",
    "            cll.ElseLevel(),\n",
    "        ],\n",
    "    )\n",
    "    ```\n",
    "    \n",
    "=== \"Settings Dictionary\"\n",
    "\n",
    "    ```py\n",
    "    first_name_comparison = {\n",
    "        'output_column_name': 'first_name',\n",
    "        'comparison_levels': [\n",
    "            {\n",
    "                'sql_condition': '\"first_name_l\" IS NULL OR \"first_name_r\" IS NULL',\n",
    "                'label_for_charts': 'Null',\n",
    "                'is_null_level': True\n",
    "            },\n",
    "            {\n",
    "                'sql_condition': '\"first_name_l\" = \"first_name_r\"',\n",
    "                'label_for_charts': 'Exact match',\n",
    "                'tf_adjustment_column': 'first_name',\n",
    "                'tf_adjustment_weight': 1.0\n",
    "            },\n",
    "            {\n",
    "                'sql_condition': 'ELSE', \n",
    "                'label_for_charts': 'All other comparisons'\n",
    "            }],\n",
    "        'comparison_description': 'Exact match vs. anything else'\n",
    "    }\n",
    "\n",
    "    ```\n",
    "Each of which gives\n",
    "\n",
    "```json\n",
    "{\n",
    "    'output_column_name': 'first_name',\n",
    "    'comparison_levels': [\n",
    "        {\n",
    "            'sql_condition': '\"first_name_l\" IS NULL OR \"first_name_r\" IS NULL',\n",
    "            'label_for_charts': 'Null',\n",
    "            'is_null_level': True\n",
    "        },\n",
    "        {\n",
    "            'sql_condition': '\"first_name_l\" = \"first_name_r\"',\n",
    "            'label_for_charts': 'Exact match',\n",
    "            'tf_adjustment_column': 'first_name',\n",
    "            'tf_adjustment_weight': 1.0\n",
    "        },\n",
    "        {\n",
    "            'sql_condition': 'ELSE', \n",
    "            'label_for_charts': 'All other comparisons'\n",
    "        }],\n",
    "    'comparison_description': 'Exact match vs. anything else'\n",
    "}\n",
    "```\n",
    "in your settings dictionary."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Levenshtein Comparison\n",
    "\n",
    "\n",
    "\n",
    "===+ \"Comparison Library\"\n",
    "\n",
    "    ```py\n",
    "    import splink.comparison_library as cl\n",
    "\n",
    "    email_comparison = cl.LevenshteinAtThresholds(\"email\", [2, 4])\n",
    "    ```\n",
    "\n",
    "=== \"Comparison Level Library\"\n",
    "\n",
    "    ```py\n",
    "    import splink.comparison_library as cl\n",
    "    import splink.comparison_level_library as cll\n",
    "\n",
    "    email_comparison = cl.CustomComparison(\n",
    "        output_column_name=\"email\",\n",
    "        comparison_description=\"Exact match vs. Email within levenshtein thresholds 2, 4 vs. anything else\",\n",
    "        comparison_levels=[\n",
    "            cll.NullLevel(\"email\"),\n",
    "            cll.LevenshteinLevel(\"email\", distance_threshold=2),\n",
    "            cll.LevenshteinLevel(\"email\", distance_threshold=4),\n",
    "            cll.ElseLevel(),\n",
    "        ],\n",
    "    )\n",
    "    ```\n",
    "\n",
    "=== \"Settings Dictionary\"\n",
    "\n",
    "    ```py\n",
    "    email_comparison = {\n",
    "        'output_column_name': 'email',\n",
    "        'comparison_levels': [{'sql_condition': '\"email_l\" IS NULL OR \"email_r\" IS NULL',\n",
    "        'label_for_charts': 'Null',\n",
    "        'is_null_level': True},\n",
    "        {\n",
    "            'sql_condition': '\"email_l\" = \"email_r\"',\n",
    "            'label_for_charts': 'Exact match'\n",
    "        },\n",
    "        {\n",
    "            'sql_condition': 'levenshtein(\"email_l\", \"email_r\") <= 2',\n",
    "            'label_for_charts': 'Levenshtein <= 2'\n",
    "        },\n",
    "        {\n",
    "            'sql_condition': 'levenshtein(\"email_l\", \"email_r\") <= 4',\n",
    "            'label_for_charts': 'Levenshtein <= 4'\n",
    "        },\n",
    "        {\n",
    "            'sql_condition': 'ELSE', \n",
    "            'label_for_charts': 'All other comparisons'\n",
    "        }],\n",
    "        'comparison_description': 'Exact match vs. Email within levenshtein thresholds 2, 4 vs. anything else'}\n",
    "    ```\n",
    "\n",
    "Each of which gives\n",
    "\n",
    "```json\n",
    "{\n",
    "    'output_column_name': 'email',\n",
    "    'comparison_levels': [\n",
    "        {\n",
    "            'sql_condition': '\"email_l\" IS NULL OR \"email_r\" IS NULL',\n",
    "            'label_for_charts': 'Null',\n",
    "            'is_null_level': True},\n",
    "        {\n",
    "            'sql_condition': '\"email_l\" = \"email_r\"',\n",
    "            'label_for_charts': 'Exact match'\n",
    "        },\n",
    "        {\n",
    "            'sql_condition': 'levenshtein(\"email_l\", \"email_r\") <= 2',\n",
    "            'label_for_charts': 'Levenshtein <= 2'\n",
    "        },\n",
    "        {\n",
    "            'sql_condition': 'levenshtein(\"email_l\", \"email_r\") <= 4',\n",
    "            'label_for_charts': 'Levenshtein <= 4'\n",
    "        },\n",
    "        {\n",
    "            'sql_condition': 'ELSE', \n",
    "            'label_for_charts': 'All other comparisons'\n",
    "        }],\n",
    "    'comparison_description': 'Exact match vs. Email within levenshtein thresholds 2, 4 vs. anything else'\n",
    "}\n",
    "```\n",
    "\n",
    "in your settings dictionary."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.9.2 ('splink-venv': venv)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.8"
  },
  "vscode": {
   "interpreter": {
    "hash": "a48c72d7d87a1a03872c2328bc7c3d312b8461ea67915604a878c0e034a87923"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
